PL-Net: progressive learning network for medical image segmentation

In recent years, deep convolutional neural network-based segmentation methods have achieved state-of-the-art performance for many medical analysis tasks. However, most of these approaches rely on optimizing the U-Net structure or adding new functional modules, which overlooks the complementation and fusion of coarse-grained and fine-grained semantic information. To address these issues, we propose a 2D medical image segmentation framework called Progressive Learning Network (PL-Net), which comprises Internal Progressive Learning (IPL) and External Progressive Learning (EPL). PL-Net offers the following advantages: 1) IPL divides feature extraction into two steps, allowing for the mixing of different size receptive fields and capturing semantic information from coarse to fine granularity without introducing additional parameters; 2) EPL divides the training process into two stages to optimize parameters and facilitate the fusion of coarse-grained information in the first stage and fine-grained information in the second stage. We conducted comprehensive evaluations of our proposed method on five medical image segmentation datasets, and the experimental results demonstrate that PL-Net achieves competitive segmentation performance. It is worth noting that PL-Net does not introduce any additional learnable parameters compared to other U-Net variants.


Introduction
The purpose of medical image segmentation is to extract regions of interest, to provide a basis for quantitative and qualitative analysis, meanwhile laying foundation for 3D visualization technology.The emergence of deep learning technology has made substantial progress in medical image segmentation methods.For example, Shelhamer et al. [1] proposed fully convolutional networks (FCN), through which the steps of extracting manual features and image postprocessing were eliminated, and the design of an end-to-end network structure was focused on.Ronneberger et al. [2] was inspired by FCN and proposed U-Net, which made a breakthrough in medical image segmentation.Whose encoder-decoderskip connection network structure inspired a large number of improved U-Net-based medical image segmentation methods [3,4,5].
At present, it is assumed in most semantic segmentation methods that the entire segmentation process can be performed through a feed-forward process of the input image, during which the global information is easily ignored.To this end, researchers add new functional modules in U-Net or optimize U-Net structure to obtain performance improvements.These methods can be divided into: (1) U-Net variants for functional optimization-oriented; (2) U-Net variants for structural optimization-oriented.
U-Net variant for functional optimization-oriented.Since medical images have many irrelevant features, it is very important to focus on target features and suppress irrelevant ones in the segmentation process.U-Net is expanded in recent works by adding different new functional modules, illustrating its potential in various visual analysis tasks.The importance of each feature channel is automatically obtained by Squeeze-and-Excitation (SE) [6] through learning.This attention mechanism is conducive to the development of U-Net [7].Moreover, the influence of spatial information on segmentation tasks is studied through ScSE [8] and FCANet [9] together with integrated concurrent spatial and channel attention into U-Net, so as to improve segmentation performance.Oktay et al. [4] proposed an attention gate for medical imaging, which automatically learned to focus on target structures of different shapes and sizes, suppressing irrelevant areas of the input image and emphasizing specific tasks.Ni et al. [10] designed an attention module to learn to distinguish features and solve specular reflection problems.Zhou et al. [11] proposed a contour-aware information aggregation network containing a multilevel information aggregation module between two task-specific decoders.The secondary shape stream and the regular texture stream are used in SAUNet [5] to capture rich shape-related information in parallel, allowing a multi-level interpretation of the external network, and reduces the need for additional calculations afterwards.A dense atrous convolution (DAC) block is used in CENet [12] to extract a rich feature representation and residual multi-kernel pooling (RMP) operation, to further encode the multi-scale context features extracted from DAC block without additional learning weights.Unlike the U-Net variants mentioned above, additional functional modules are not introduced in PL-Net.We capture coarse-grained and fine-grained semantic information through progressive learning, and reuse the features learned to improve the performance of single-stage U-Net.
U-Net variant for structure optimization-oriented.Different from the U-Net variant method for functional optimization-oriented, different levels of feature information can be extracted by optimizing its structure, which is feasible and effective for many computer vision problems.U-Net benchmark blocks are replaced by ResU-Net [13], DSNet [14] and FCDenseNet [15] with residual or densely connected blocks to improve the ability of mining features.W-Net [16] draws on the idea of the supervised semantic segmentation method, which solves the problem of unsupervised segmentation by connecting two U-Nets through an auto-encoder style model.The characteristics of the above two structural optimizations, are combined via DAGAN [17], through which a generative adversarial network (GAN) model is proposed, including an encoder-decoder segmentation module and a dual discriminator module, skip connections and dense convolution blocks are used to obtain discriminative feature representations, and the proposed GAN model is trained in an end-to-end manner.In addition to replacing benchmark blocks, the performance of different tasks can be improved by increasing the number of Ushaped network structures, as is verified in the literature [18,19,20,21].The nested or recursive iteration can refine the features extracted at different stages.The U-Net structure is redesigned by U-Net++ [3] through a series of nested and dense skip connections, reducing the semantic gap between the subnet feature maps encoder and decoder.Wang et al. proposed RUNet [22], with which multiple pairs of encoders and decoders of U-Net were repeatedly connected to enhance its ability to distinguish semantic segmentation, but an additional learnable block was introduced in this mothed.A similar method is used by R2U-Net [23] to RUNet, in which the advantages of U-Net, residual network [24] and RCNN [25] are used to achieve a better medical image segmentation performance.Xiang et al. designed BiO-Net [26] with backward skip connections on the basis of RUNet, which could reuse the features of each decoding level to achieve a more intermediate information aggregation.The emergence of BiO-Net allows building blocks to be reused by U-Net in a circular manner without introducing any additional parameter.
In this research, we take another stand on medical image segmentation.Firstly, we divide feature learning into two "" in U-Net of different depths.Through this internal progressive learning strategy, receptive fields of different sizes can be obtained, so that semantic information can be learned through the network with different granularities.Then, we perform the entire segmentation process through multiple feedforward processes.More specifically, we divide the training into two "".When each training stage ends, the features obtained from the current training stage are passed to the next for fusion.This transfer operation essentially enables the model to mine more fine-grained information based on what it learned in its previous training phase, which is very simple but effective for refining the coarse segmentation output.Through it, the overall structure of the coarse segmentation output can be understood, and at the same time the segmentation details can be refined after each stage without control problems [27].Output channel scale (Ocs) Figure 1: Overview of the progressive learning network.Compared with previous works, no additional parameters or functional modules are added based on U-Net through our method, and building blocks are reused by means of internal progressive learning to better refine features, the external progressive learning strategy is used to optimize the parameters of each stage as well as fuse coarse-grained and fine-grained information.The features extracted from all stages are connected only in the last step to further ensure that the complementary relationship among features is fully explored.The overview of PL-Net is shown in Fig 1 .Since through the progressive learning strategy, the features of the previous "steps" and "stages" can be reused, the dependence of PL-Net on the number of channels is not obvious.By adjusting the output channel scale (expressed in Ocs), we designed a standard PL-Net (15.03 M) and a smaller version of PL-Net † (Ocs = 0.5, 3.77 M).We apply the proposed method to skin lesion segmentation and nucleus segmentation tasks.The experimental results show that the standard PL-Net is superior to other state-of-the-art methods such as U-Net and BiO-Net.In addition, although PL-Net † contains a small number of parameters, it still maintains a good segmentation performance, which is directly related to the progressive learning strategy we proposed.

Progressive Learning Network
We now describe PL-Net, a progressive learning framework for medical image segmentation.As is shown in Fig 1, PL-Net is a multi-level U-Net network architecture that does not rely on additional functional modules but has paired bidirectional connections.The core of our framework is to enhance the feature representation required for image segmentation through two progressive learning approaches (internal and external) and to fuse coarse-grained as well as finegrained semantic information.
Two U-Nets of different depths form different learning "".At each stage, as the "" increase, the shallow network is expanded into a deeper one, from which stable multi-granularity information is learned.In brief, the number of parameters is not increased through internal progressive learning, but it can learn feature maps with different sizes of receptive fields; external progressive learning is defined as the coarse segmentation stage ( 1) and the refined segmentation stage ( 2).The input image will be checked on multiple scales to achieve the fusion of coarse-grained and fine-grained information.

Internal Progressive Learning
Bidirectional skip connections are used in internal progressive learning to reuse building blocks.To enable the network at each stage to learn distinctive feature representations, we use two "" to gradually mine the features from shallow to deep.
Forward Skip Connections (FSC) are used to assist up-sampling learning, restore the contour of the segmentation target, and retain the low-level visual features of encoding.The feature    after FSC can be expressed as: ) Among them, "[•]" is the concatenation layer,   means that the convolution operation of the s-th "" ( ∈ {1,2}) is applied to the input feature map,  and  ̂ are feature maps of the same size in the down-sampling and up-sampling path respectively.
Unlike FSC, Backward Skip Connections (BSC) are used for flexible aggregation of low-level visual features and highlevel semantic features.To realize the complementation and fusion of semantic information at different stages, multigranularity information of different "" and "" is combined in feature   : It is worth noting that the reasoning path of internal progressive learning can be extended to multiple recursions to obtain instant performance gains.More importantly, a larger receptive field will be got in each output of this learning strategy than the previous "".We use   () to represent the i-th complete encoding-decoding process, and    is used to represent the output.Therefore,    can be written as: In this study, we define  = 2, and through such a setting the parameters equivalent to BiO-Net can be maintained.In future research, the setting of  > 2 can be used to further improve the segmentation accuracy, but the exploration of the optimal hyperparameter setting is beyond the scope of this paper.

External Progressive Learning
The external progressive learning strategy first trains the low stage (stage1), and then gradually trains toward the high stage (stage2).Since "stage1" is relatively shallow in depth and limited by the perceptual field and performance ability, it will focus on local information extraction, while "stage2" incorporates the local texture information learned from "stage1".Compared with directly training the entire network in series, in the model, it is allowed by this incremental nature to pay attention to global information as the features gradually enter a higher stage.
For each stage of training, we calculate the loss based on the Dice coefficient (ℒ  ) [28] between the ground truth (  ) and the predicted probability (   ) distribution of different stages: Here "|•|" is an operator through which the number of pixels is found in the qualified area.In each iteration, the input data will be used in each learning stage (where  ∈ {1,2}).What needs to be clear is that when the latter stage is predicted, all the parameters of the previous stage are optimized and updated, which helps each stage in the model to work together.
Since the low stage is mainly to assist the feature expression and knowledge reasoning of the high-stage network, we can delete the low-stage prediction layer (Sigmoid layer) when predicting, thereby reducing the reasoning time.In addition, the predictions at different stages are unique, but they can form complementary information among different granularities.When we combine all outputs with an equal weight, it will result in a better performance.In other words, the final output of the model is determined by all stages: (5)

PL-Net Architecture
Our framework has a trade-off between performance and parameters.Like U-Net, only standard convolutional layers, batch normalization layers and ReLU layers without introducing any additional functional module are used in the downsampling and up-sampling stage of PL-Net Table 1 is the detailed configuration of U-Net, BiO-Net and our PL-Net.
Table 1: Detailed configuration of U-Net, BiO-Net, and our PL-Net architecture.As is shown in Table 1, BiO-Net has a maximum coding depth of 4, using BSC from the deepest coding level, and inputting the decoded features in each iteration into the last-stage block.BSC is also used in PL-Net.Unlike the previous methods, the convolutional layer is allowed to be used in the model to mine features from coarse-grained to fine-grained ones in a progressive manner.It should be noted that a smaller version of PL-Net † can be obtained only by adjusting the Ocs, whose depth and connection method will not change.

Datasets
ISIC 2017 [29].This dataset consists of 2000 training images, 150 verification images and 600 test images.The images in the original dataset provided by ISIC have different resolutions.We first use the color consistency algorithm of the gray world to normalize the colors of the images, and then adjust the size of all images to 224 2 pixels.All the experimental results of this dataset reported in the paper are from the official test set results.The sample image is shown in Fig 2 (a).
Kaggle 2018 data science bowl (referred to as Nuclei) [30].This dataset is provided by the Booz Allen Foundation, which contains 670 cell feature maps and provides a ground truth for each image.We adjust all images and corresponding ground truth to 224 2 pixels, use 80% of the images for training and the rest for testing, and perform a 5-fold cross validation.PH2 [31].The PH2 dataset contains 200 thermoscopic images, the size of which is fixed at 768×560 pixels.The dataset contains 80% of benign mole cases and 20% of melanoma ones, providing ground truth annotated by dermatologists.Due to the small scale of the dataset, we use the preprocessing method of the ISIC 2017 dataset and the trained model to directly predict all images of the dataset, to evaluate the performance of different models.

Implementation Details
We perform all experiments on Tesla V100 GPU and Keras, expand the training dataset by applying random rotation (± 25°), random horizontal and vertical shift (15%) as well as random flip (horizontal and vertical).For all models, we train more than 200 epochs, with a batch size of 16, a fixed learning rate of 1e-4 and an Adam [32] optimizer with a momentum of 0.9 to minimize Dice loss.We use an early stop mechanism and stop training when the verification loss reaches a stable level with no significant change in 20 epochs.Unless explicitly specified, the number of "" and "" of PL-Net is 2, and BSC are established at each stage of the network.When testing, all prediction layers are deleted before the last "", and other configurations remain unchanged.

Ablation Study
We conduct ablation studies to understand the effectiveness of internal and external progressive learning strategies.When there is no internal progressive learning strategy, features are extracted through the model by naturally stacking benchmark blocks, with which we experimented for layer 1 and 2 respectively.Adopting an internal progressive learning strategy means that the encoder-decoder must be iterated for  times in different stages, and we set  = 2 and  = 3.When external progressive learning is not performed, different "" are connected in series through PL-Net to transfer the feature information learned in each stage.Only the parameters in the last stage are optimized and the segmentation results are output through the model.That is to say, the feature information obtained in the current "" of training is transferred to the next training "" and fusion through the external progressive learning method, so fine-grained information can be mined through the model based on learning in the previous training "".Table 2 shows our IoU (Dice) without/with a progressive learning strategy on three different medical image segmentation datasets.We give the parameters and model size of different situations to comprehensively analyze the segmentation performance.In most cases, the best segmentation performance can be obtained through PL-Net when both internal ( = 2) and external progressive learning strategies are used at the same time.Compared with the model with the same parameter settings without using IPL, the segmentation performance is improved significantly.These results prove the effectiveness of EPL and IPL.To make a trade-off among factors such as performance and parameters, we will use the setting of  = 2 in the following experiments.We believe that the setting of  = 3 may be more effective as the amount of data increases.
In addition to the above ablation studies, we also consider the impact of the output channel scale (Ocs) on the segmentation performance of different datasets.Fig 3 shows the experimental results of three datasets.We set  ∈ [0.5, 2.0] and take values at an interval of 0.25.Note that  = 0.5 is a smaller version of PL-Net †.When  = 1.0, the best segmentation result can be obtained and the parameter amount (15.03MB) is well weighed.When  > 1.0, the segmentation performance is improved as the number of channels increases, but it is not as high as the standard PL-Net.We analyze that the reason for this situation is the limitation of the data size and the complexity of the segmentation content.The segmentation performance of PL-Net † is slightly lower than that of other networks, but there are a very small number of parameters.Thus, it is recommended to be used to run on small datasets.At the same time, it can be configured to servers or mobile devices with lower hardware requirements.ISIC2017 and PH2 datasets.In these two tasks, our PL-Net, the baseline U-Net and other state-of-the-art methods [2,3,4,7,9,14,17,21,23,26,33,34] for comparison are used.Among them, functional optimization-oriented variants of U-Net are [4,7,9,14,23,34]; structural optimization-oriented variants of U-Net are [3,14,17,21,23,26].To be fair, we either use the experimental results provided by the authors in the same test set or run their models published in the same environment.

Quantitative Comparison
The accuracy (Acc), intersection over union (IoU), Dice coefficient (Dice), sensitivity (Sens) and specificity (Spec) through different segmentation methods are reported in Table 3 on the ISIC2017 and PH2 dataset.As we can see, our PL-Net has the best performance with the two evaluation indicators of the ISIC 2017 dataset.Among them, IoU and Dice of PL-Net are 0.6% and 0.3% higher than that of BiO-Net (t = 3, INT) respectively, and the same level and Dice have been achieved as that of BiO-Net (t = 3) (14.30M) through PL-Net † (3.77M) with a smaller size.The PH2 dataset also belongs to the thermoscopic image segmentation task.Compared to state-of-the-art methods, PL-Net shows excellent performance in PH2 dataset.Due to the better performance of PL-Net in this dataset, it can be argued that our method is more versatile.In addition, PL-Net † achieved segmentation performance comparable to U-Net.Compared with other models, it has much fewer parameters, but competitive performance can still be achieved through it.
Nuclei segmentation.The feature distribution of this dataset is uneven, and the shape of positive and negative samples is very different.Table 4 shows the quantitative comparison results.We compared our method with 12 others.Compared with the latest TransAttUnet-R [35], a better comprehensive segmentation performance is achieved through PL-Net, which is relatively 0.27% to 1.16% higher with different evaluation indicators.The segmentation performance of U-Net++ is between that of our PL-Net † and PL-Net, whose IoU and Dice are 85.56% and 91.59% respectively.From the results of five cross validation experiments, the standard PL-Net is more stable than PL-Net †, and the standard deviation is relatively reduced by 14%.Although Att R2U-Net has a higher Dice (92.15 vs. 92.13)than PL-Net, its overall performance and stability are slightly worse.It is worth noting that BSC is used in both PL-Net and BiO-Net, but our method has a better overall performance and even almost the same IoU and Dice as BiO-Net (t = 3, INT) is able to obtain through a smaller PL-Net †.The above quantitative comparison shows that our proposed network can be applied to different segmentation tasks.Better segmentation results can be produced through PL-Net even for challenging images (such as nuclei segmentation).Although the overall segmentation performance of PL-Net † is not as good as that of standard PL-Net, the smaller parameter and model size will promote its application in memory-constrained environments.In addition, it can be seen from the experimental results that the segmentation performance of the original U-Net can be improved to a certain extent through both functional optimization-oriented and structural optimization-oriented U-Net variants, but one of the problems that are difficult to avoid is the increase in computational overhead.A progressive learning strategy is added to PL-Net because of U-Net, which makes it a good trade-off between segmentation performance and parameters.

Qualitative Comparison
Figure 4: Qualitative segmentation results of ISIC 2017, Nuclei and PH2 datasets using different methods.To better understand the good performance of our method, we show the example results of PL-Net and several other state-of-the-art methods [2,3,4,9,23,26,32] in Fig 4 .As we have seen, through our PL-Net and PL-Net †, different types of targets can be handled, and accurate segmentation results can be produced.
The 1st and 2nd row of Fig 4 respectively show the segmentation results of an ambiguous target area and a small amount of occlusion (hair).As we have observed, although results produced by PL-Net are not that accurate, our method is still effective for areas with ambiguous targets compared to other methods.When segmenting occluded images, other models are either prone to an incorrect boundary division or mistake the masked areas as the target areas.The segmentation target of the image in the 3rd row is clear, and relatively accurate segmentation results can be produced through other methods.However, for the content marked in the red box the interfering pixels for the target pixels for segmentation are mistaken through most methods, and better result are produced through our method compared with other methods.The 4th row shows the performance of different models for targets consisting of tiny targets and dispersed structures.As we have seen, through U-Net and Att U-Net, either the saliency area is recognized as the target area, or the target area is lost, and poor segmentation results are produced.The 5th and 6th row show the segmentation results of different methods for smaller and larger targets.As we can see, a good decision is made through our model on the boundary of the small target, while the area marked in the red box cannot be segmented well through other models.Compared with the 5th row, the lesion area shown in the 6th row covers almost the entire image.Although more accurate segmentation results can be produced through other methods more perfect results are produced through our PL-Net as far as the area marked in the red box is concerned.

Conclusion
In this study, we propose a new variant of U-Net called PL-Net for medical image segmentation, which is mainly composed of internal and external progressive learning strategies.Compared with those U-Net methods that are functional or structural optimization-oriented, our PL-Net has a better performance with no additional trainable parameters.We provide a standard PL-Net (15.03 M) and a smaller-version PL-Net † (3.77M).The experimental results in three public medical image segmentation datasets show that these two models are very competitive with other state-of-the-art methods in terms of qualitative and quantitative analysis.
In future research, we plan to study the construction of end-to-end models for different medical image segmentation tasks and focus on designing full-resolution image segmentation methods while maintaining their ability to generate accurate segmentation masks.In addition, we should pay more attention to the ability of PL-Net to maintain good generalization in different data scales, image resolutions and analysis tasks.
Fig 2 (b)  shows a sample image of the dataset and the corresponding ground truth.

Figure 2 :
Figure 2: (a)-(c) are the sample images and corresponding ground truth of ISIC 2017, Nuclei and PH2 datasets.

Figure 3 :
Figure 3: Test IoU and Dice vs. Ocs on three public datasets.The results are computed with 5 runs, shown with standard error.We mark the parameter of the model at the top of the bar charts.

Table 2 :
Ablative results.IoU (Dice), number of parameters, and model size are reported.

Table 3 :
Comparison of segmentation methods between ISIC 2017 test set and PH2 datasets.Red, Green, and Blue indicate the best, second best and third best performance.

Table 4 :
Comparison of Nuclei dataset segmentation methods.Red, Green, and Blue indicate the best, second best and third best performance.For the original implementation methods, we report mean ± standard deviation.